Ignore freestanding ellipses #235

Enkidu93 · 2024-08-23T17:02:47Z

I think this is the simplest solution. Let me know what you think.

This change is

johnml1135 · 2024-08-23T18:12:11Z

tests/SIL.Machine.Tests/Corpora/UsfmMemoryTextTests.cs line 217 at r1 (raw file):

            @"\id MAT - Test
\c 1
\v 1 Verse text ... More text

What about things like this:

\v 1 ...
\q1 ...
\q2 ...
\v2 ... \f footnote \f*

I'm probably doing it wrong, but what is the defined behavior if the ellipses is between other USFM markers?

johnml1135 · 2024-08-23T18:14:23Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 241 at r1 (raw file):

                {
                    text = "";
                }

Should this happen before tokenization? I'm not quite understanding.

johnml1135 · 2024-08-23T18:22:23Z

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 174 at r1 (raw file):

        {
            // if we hit text in a verse paragraph and we aren't in a verse, then start a non-verse segment
            if (text.Trim().Length > 0 && text.Trim() != "...")

This seems like the wrong place to change it - this was added to solve a different issue. Does this change cause the tests to fail? It likely should...

johnml1135 · 2024-08-23T18:37:53Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 241 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Should this happen before tokenization? I'm not quite understanding.

This may work - but I don't understand this part of the parse to know where all the places that it will work. I also don't know how paratext populates \q1, and other poetry markers - nor to I know how this may handle populating section headers etc. with ... if that is an issue.

johnml1135

Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Enkidu93

To respond to all your comments in one place, @johnml1135, my goal was to treat verse's whose only text is '...' the same as white-space-only verses. This might be a hacky solution, but I'm not sure what the less hacky solution would be. I could certainly define a regex or something if we're interested in potentially treating other text as white space (e.g., only punctuation or something). Or maybe there's a way to think of this as a special kind of usfm token, but I also don't want to go overboard just for this small issue.

Judging from the project I was working with in which I originally saw this issue, it appears that '...' only appears as filler for an entire verse text - i.e., I didn't see examples of multiple poetic lines (or something similar) each with '...', but I'm not sure if those might exist as well since, like I said, I'm just basing this off one project.

Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit)

ddaspit

We are trying to handle the specific case where Paratext inserts ellipses to fill an empty verse. I think you should narrow the check even further and only apply it when you are creating text rows for a verse, i.e. UsfmTextBase.EndVerseText. I want to avoid incorrectly interpreting the intentional use of ellipses as empty text.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)

johnml1135 · 2024-08-26T16:35:22Z

I agree that changing the code there would prevent partial matches. @ddaspit - do you know all the instances in which the ellipsis might be added in Paratext?

Enkidu93

Like this, @ddaspit ?

Yeah, it would be nice to know what Paratext's behavior is such that we can make appropriately narrow and also so that we can have reasonable tests.

Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit)

codecov-commenter · 2024-08-26T16:47:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.81%. Comparing base (1220953) to head (0d743a7).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #235   +/-   ##
=======================================
  Coverage   69.80%   69.81%           
=======================================
  Files         379      379           
  Lines       31478    31482    +4     
  Branches     4399     4400    +1     
=======================================
+ Hits        21974    21978    +4     
  Misses       8483     8483           
  Partials     1021     1021

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

johnml1135 · 2024-08-26T16:58:29Z

src/SIL.Machine/Corpora/UsfmTextBase.cs line 276 at r2 (raw file):

            protected override void EndVerseText(UsfmParserState state, IReadOnlyList<ScriptureRef> scriptureRefs)
            {
                string text = _rowTexts.Pop().ToString();

I think a simple comment would be appropriate here - what are we doing and why? "Paratext sometimes auto-adds an ellipses for newly instantiated verses. This should be ignored for NLP tasks." or similar.

johnml1135

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

ddaspit

Let me check on the Paratext behavior and get back to you.

Reviewed 1 of 2 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

ddaspit

I can't figure out where the ellipses are coming from. I thought they were something that Paratext would insert in some cases, but I can't find any evidence of that. They might be coming from Adapt It, but I'm not sure how the ellipses would get back into the source USFM from Adapt It. If this isn't standard Paratext behavior, we shouldn't handle this in Machine. Instead, we should handle this during preprocessing in Serval.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Enkidu93

Maybe we can ask Michael about it tomorrow (?), @ddaspit, before committing to that switch. He seemed fairly sure that Paratext created the ellipses when the team made an empty book and my impression was that this was typical.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

ddaspit

I spoke to Michael about it. He doesn't know where they come from. I actually looked at the Paratext repo and found no trace of any code that adds ellipses. I looked at a couple of projects that contain them and can't find a consistent pattern.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Enkidu93

OK, so what's the plan here. Do you want to relocate this logic? What do you mean by preprocessing in Serval? There isn't really any file preprocessing is there, right?

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

ddaspit

We want to filter out any segments that are ... when generating the training corpus and pretranslation JSON file in PreprocessBuildJob.WriteDataFilesAsync.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Enkidu93

OK, just making sure I wasn't missing something 😆

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

Enkidu93 · 2024-08-27T19:55:21Z

Moving filtration to serval: sillsdev/serval#469

Ignore freestanding ellipses

7f02609

Enkidu93 requested review from ddaspit and johnml1135 August 23, 2024 17:02

johnml1135 reviewed Aug 23, 2024

View reviewed changes

Enkidu93 commented Aug 26, 2024

View reviewed changes

ddaspit requested changes Aug 26, 2024

View reviewed changes

Move check to text generation

0d743a7

Enkidu93 commented Aug 26, 2024

View reviewed changes

johnml1135 reviewed Aug 26, 2024

View reviewed changes

ddaspit reviewed Aug 26, 2024

View reviewed changes

Enkidu93 commented Aug 26, 2024

View reviewed changes

ddaspit reviewed Aug 26, 2024

View reviewed changes

Enkidu93 commented Aug 27, 2024

View reviewed changes

ddaspit reviewed Aug 27, 2024

View reviewed changes

Enkidu93 commented Aug 27, 2024

View reviewed changes

Enkidu93 mentioned this pull request Aug 27, 2024

Ignore ellipses sillsdev/serval#469

Merged

Enkidu93 closed this Aug 27, 2024

Enkidu93 deleted the ignore_ellipses branch August 27, 2024 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore freestanding ellipses #235

Ignore freestanding ellipses #235

Enkidu93 commented Aug 23, 2024 •

edited by ddaspit

Loading

johnml1135 commented Aug 23, 2024

johnml1135 commented Aug 23, 2024

johnml1135 commented Aug 23, 2024

johnml1135 commented Aug 23, 2024

johnml1135 left a comment

Enkidu93 left a comment

ddaspit left a comment

johnml1135 commented Aug 26, 2024

Enkidu93 left a comment

codecov-commenter commented Aug 26, 2024

johnml1135 commented Aug 26, 2024

johnml1135 left a comment

ddaspit left a comment

ddaspit left a comment

Enkidu93 left a comment

ddaspit left a comment

Enkidu93 left a comment

ddaspit left a comment

Enkidu93 left a comment

Enkidu93 commented Aug 27, 2024

Ignore freestanding ellipses #235

Ignore freestanding ellipses #235

Conversation

Enkidu93 commented Aug 23, 2024 • edited by ddaspit Loading

johnml1135 commented Aug 23, 2024

johnml1135 commented Aug 23, 2024

johnml1135 commented Aug 23, 2024

johnml1135 commented Aug 23, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

Enkidu93 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Aug 26, 2024

Enkidu93 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 26, 2024

Codecov Report

johnml1135 commented Aug 26, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

Enkidu93 left a comment

Choose a reason for hiding this comment

Enkidu93 commented Aug 27, 2024

Enkidu93 commented Aug 23, 2024 •

edited by ddaspit

Loading